Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets
نویسندگان
چکیده
Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts on the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, that cannot be effectively classified by T are treated as suspicious and forwarded to a subset S. For each attribute Ai, we switch Ai and the class label C to train a classifier APi for Ai. Given an instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure based on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.
منابع مشابه
Robustness of Threshold-Based Feature Rankers with Data Sampling on Noisy and Imbalanced Data
Gene selection has become a vital component in the learning process when using high-dimensional gene expression data. Although extensive research has been done towards evaluating the performance of classifiers trained with the selected features, the stability of feature ranking techniques has received relatively little study. This work evaluates the robustness of eleven threshold-based feature ...
متن کاملLearning Homophily Couplings from Non-IID Data for Joint Feature Selection and Noise-Resilient Outlier Detection
This paper introduces a novel wrapper-based outlier detection framework (WrapperOD) and its instance (HOUR) for identifying outliers in noisy data (i.e., data with noisy features) with strong couplings between outlying behaviors. Existing subspace or feature selection-based methods are significantly challenged by such data, as their search of feature subset(s) is independent of outlier scoring ...
متن کاملAn Alternative Ranking Problem for Search Engines
This paper examines in detail an alternative ranking problem for search engines, movie recommendation, and other similar ranking systems motivated by the requirement to not just accurately predict pairwise ordering but also preserve the magnitude of the preferences or the difference between ratings. We describe and analyze several cost functions for this learning problem and give stability boun...
متن کاملتشخیص هوشمند و سریع بیماری قلبی بر اساس همافزایی شبکههای عصبی خطی و روش رگرسیون منطقی
Background and purpose: Diseases have been the greatest threat for human being along the history. Heart disease (HD) has gained special attention in medical studies. Recently studying on classification and diagnosis of HD as a key topic and a lot of researches have been done in order to increase precise and reduce error in this type of decisions. With development of intelligent learning syst...
متن کاملCredit Card Fraud Detection using Data mining and Statistical Methods
Due to today’s advancement in technology and businesses, fraud detection has become a critical component of financial transactions. Considering vast amounts of data in large datasets, it becomes more difficult to detect fraud transactions manually. In this research, we propose a combined method using both data mining and statistical tasks, utilizing feature selection, resampling and cost-...
متن کامل